4 research outputs found

    A Scalable Near-Memory Architecture for Training Deep Neural Networks on Large In-Memory Datasets

    Get PDF
    Most investigations into near-memory hardware accelerators for deep neural networks have primarily focused on inference, while the potential of accelerating training has received relatively little attention so far. Based on an in-depth analysis of the key computational patterns in state-of-the-art gradient-based training methods, we propose an efficient near-memory acceleration engine called NTX that can be used to train state-of-the-art deep convolutional neural networks at scale. Our main contributions are: (i) a loose coupling of RISC-V cores and NTX co-processors reducing offloading overhead by 7 x over previously published results; (ii) an optimized IEEE 754 compliant data path for fast high-precision convolutions and gradient propagation; (iii) evaluation of near-memory computing with NTX embedded into residual area on the Logic Base die of a Hybrid Memory Cube; and (iv) a scaling analysis to meshes of HMCs in a data center scenario. We demonstrate a 2.7 x energy efficiency improvement of NTX over contemporary GPUs at 4.4 x less silicon area, and a compute performance of 1.2 Tflop/s for training large state-of-the-art networks with full floating-point precision. At the data center scale, a mesh of NTX achieves above 95 percent parallel and energy efficiency, while providing 2.1 x energy savings or 3.1 x performance improvement over a GPU-based system

    Variable delay ripple carry adder with carry chain interrupt detection

    No full text
    A statistical approach for the area efficient implementation of fast wide operand adders using early termination detection is described and analyzed. It is shown that high throughput can be achieved based on area- and routing-efficient ripple-carry adders with only marginal overhead. They share a low AT-product with Brent-Kung adders but provide designers with totally different area/delay tradeoffs. The circuit does not require full-custom design and fits well into both self-timed and synchronous designs

    Efficient ASIC implementation of a real-time depth mapping stereo vision system

    No full text
    This paper presents a fast and area-efficient implementation of a real-time stereo vision algorithm for spatial depth mapping. The design combines two well-known area-based approaches to stereo thatching and includes an occlusion detection method. Hardware efficiency is achieved by storing only partial images on-chip, avoiding full-sized frame buffers. A low-latency dataflow-oriented structure makes it possible to process 256 x 192 pixel Input streams with a rate In excess of 50 frames per second, amounting to more than 54 million pixel x disparity measurements per second (PDS) (for a 25-pixel disparity range), or roughly 18 GOPS. The design has been Integrated In a 0.25 mu m standard CMOS technology and occupies an area of less than 3 mm(2)

    Real-time high-sensitivity impedance measurement interface for tethered BLM biosensor arrays

    No full text
    This paper presents a switched-capacitor (SC) current integrator circuit for impedance measurement of tethered bilayer lipid membrane (tBLM) biosensors. The circuit comprises a small number of high performance components enabling enhanced experimental flexibility and reliability. The sensitivity is improved significantly by suppressing the output offset through pseudo-differential operation, using R-C components for the reference impedance. The sensing and reference electrodes are excited with low-amplitude differential voltage pulses and the current response to membrane resistance (RM) change of the tBLM biosensor is converted to voltage by a precision, low-noise SC integrator available as a single-package IC. Tests with both electrical models and actual biosensors demonstrated that the proposed circuit operates with high sensitivity and can be used in single chip versions for low-cost and high-sensitive tBLM biosensor arrays, featuring multiple electrode sites
    corecore